Start with data containing PM2.5, anomalous AOT, smoke day, height in meters above ground level, and pressure in hPa. Data is the result of matching 72-hour trajectories initialized from HYSPLIT points in California 2020 following Brey et al. with EPA station-day-PM2.5 data and then merging with 10 km grid of AOT, mean AOT, and smoke day. Anomalous AOT is AOT - mean AOT.

Aggregation method and cutoff determine how trajectory points were matched to EPA stations.

# Choose aggregation method: nn, idw, or k (idw of knn)
agg <- "idw"

# Choose cutoff: 10, 30, 50, 80, or 100 km
cut <- 30

# Configure data frame based on the above
dat_merged1 <- dat_merged %>% filter(agg_method == agg, cutoff == cut)

Let’s start modeling

We model PM2.5
* \(PM_{2.5} \sim AOT_{anom} + AOT_{anom} \times smokeday\)
* \(PM_{2.5} \sim AOT_{anom} + AOT_{anom} \times smokeday + AOT_{anom} \times smokeday \times height\)
* \(PM_{2.5} \sim AOT_{anom} + AOT_{anom} \times smokeday + AOT_{anom} \times smokeday \times pressure\)

df <- dat_merged1
mdl0 <- lm(pm25 ~ aot_anom + aot_anom:smoke_day,
           data = df)
mdl1 <- lm(pm25 ~ aot_anom + aot_anom:smoke_day + aot_anom:smoke_day:height,
           data = df)
mdl2 <- lm(pm25 ~ aot_anom + aot_anom:smoke_day + aot_anom:smoke_day:pressure,
           data = df)

Model with just anomalous AOT and smoke day does better than models with height or pressure. Based on the last PM2.5 modeling notebook I saw, I think these R2 values are higher? All terms are significant, but effect sizes are small for terms with height or pressure (or maybe they’re not small and it’s just an artifact of units measuring height or pressure). Effect direction for height follows what I would expect, but not pressure (hope I’m recalling science correctly here).

Dependent variable:
pm25
(1) (2) (3)
aot_anom 4.765*** 5.167*** 5.430***
(1.040) (1.512) (1.515)
aot_anom:smoke_day 46.930*** 52.769*** 63.912***
(1.054) (1.588) (2.427)
aot_anom:smoke_day:height -0.004***
(0.0003)
aot_anom:smoke_day:pressure -0.024***
(0.002)
Constant 7.579*** 8.197*** 8.048***
(0.071) (0.120) (0.121)
Observations 92,536 52,928 52,928
R2 0.415 0.400 0.398
Adjusted R2 0.415 0.400 0.397
Residual Std. Error 20.818 (df = 92533) 25.792 (df = 52924) 25.839 (df = 52924)
F Statistic 32,802.810*** (df = 2; 92533) 11,745.030*** (df = 3; 52924) 11,639.840*** (df = 3; 52924)
Note: p<0.1; p<0.05; p<0.01

However, I am not sure if this is the right regression or if we should still include fixed effects of some sort. If I’ve understood correctly, anomalizing AOT accomplishes what FEs had originally been intended to do, which was capture background AOT. But I guess we may also want to capture background PM2.5? So we’ll try that later in the document.

Before that, let’s get predicted PM2.5 and compare the time series from August to Mid-October in the Bay Area.

df$pred_aotanom_smokeday <- predict(mdl0)
df[!is.na(df$height), "pred_aotanom_smokeday_height"] <- predict(mdl1)
df[!is.na(df$pressure), "pred_aotanom_smokeday_pressure"] <- predict(mdl2)
t1 <- as.Date("2020-08-01")
t2 <- as.Date("2020-10-15")
event <- df %>% 
  filter(t1 <= date, date <= t2, cbsa_name == "San Francisco-Oakland-Hayward, CA") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()

It look like all the models are capturing the big spike and then overestimating the medium spikes sometimes. Not much difference between the model predictions, except we’re missing predictions where height and pressure are missing due to no match between EPA station and trajectory points.

Now we use demeaned PM2.5

We regress PM2.5 on fixed effects for EPA station ID and month of the year (we only have 2020 here) to get demeaned PM2.5 values.

df <- dat_merged1

# Get residuals and fitted values
mdlfe <- feols(pm25 ~ 1 | id_epa + month, data = df)
df$pm25_dmn <- mdlfe$resid
df$fv <- mdlfe$fitted.values

Same models as before but now on demeaned PM2.5.

mdl0 <- lm(pm25_dmn ~ aot_anom + aot_anom:smoke_day,
           data = df)
mdl1 <- lm(pm25_dmn ~ aot_anom + aot_anom:smoke_day + aot_anom:smoke_day:height,
           data = df)
mdl2 <- lm(pm25_dmn ~ aot_anom + aot_anom:smoke_day + aot_anom:smoke_day:pressure,
           data = df)

These R2 values now look more like the ones from the last notebook I saw. R2 is slightly larger for models with height or pressure, and model with height does best. All terms are significant.

Dependent variable:
pm25_dmn
(1) (2) (3)
aot_anom -7.600*** -6.565*** -6.346***
(1.069) (1.506) (1.513)
aot_anom:smoke_day 44.175*** 52.799*** 47.864***
(1.084) (1.582) (2.424)
aot_anom:smoke_day:height -0.006***
(0.0003)
aot_anom:smoke_day:pressure -0.006***
(0.002)
Constant -3.423*** -4.226*** -4.350***
(0.073) (0.120) (0.121)
Observations 92,536 52,928 52,928
R2 0.252 0.264 0.257
Adjusted R2 0.252 0.264 0.257
Residual Std. Error 21.413 (df = 92533) 25.696 (df = 52924) 25.815 (df = 52924)
F Statistic 15,589.270*** (df = 2; 92533) 6,327.997*** (df = 3; 52924) 6,108.691*** (df = 3; 52924)
Note: p<0.1; p<0.05; p<0.01

We add back in the fitted values from the FE regression, get predicted PM2.5 (fitted value from FE reg + residual predicted by our models), and compare the time series from August to Mid-October in the Bay Area.

df$pred_dmn_aotanom_smokeday <- predict(mdl0)
df[!is.na(df$height), "pred_dmn_aotanom_smokeday_height"] <- predict(mdl1)
df[!is.na(df$pressure), "pred_dmn_aotanom_smokeday_pressure"] <- predict(mdl2)

df <- df %>% 
  mutate(across(starts_with("pred_"), as.numeric),
         pred_dmn_aotanom_smokeday = pred_dmn_aotanom_smokeday + fv,
         pred_dmn_aotanom_smokeday_height = pred_dmn_aotanom_smokeday_height + fv,
         pred_dmn_aotanom_smokeday_pressure = pred_dmn_aotanom_smokeday_pressure + fv)
event <- df %>% 
  filter(t1 <= date, date <= t2, cbsa_name == "San Francisco-Oakland-Hayward, CA") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()

Models capture the big spike, though now slightly underestimating. But now greatly overestimating the big dip in later September. Overestimating the rise in early September. Overestimating the lower PM2.5 values generally. Still have weird inversions in later August, though not as severe as in the previous graph.

It looks like models with just anomalous AOT and smoke day are able to capture the big spike, so I wonder if height is really needed, especially considering the missingness due to matching, or if I’ve done something wrong somewhere…

Update 1

We change to using the 100 km cutoff so that we don’t miss observations across models.

# Re-configure merged data to use max cutoff (100 km)
agg <- "idw"
cut <- 100
dat_merged1 <- dat_merged %>% filter(agg_method == agg, cutoff == cut)

We change our baseline model to
* \(PM_{2.5} \sim AOT_{anom} + smokeday + AOT_{anom} \times smokeday + stationFE + samplemonthFE\)
and compare with
* \(PM_{2.5} \sim AOT_{anom} + smokeday + AOT_{anom} \times smokeday + AOT_{anom} \times smokeday \times height + stationFE + samplemonthFE\)
* \(PM_{2.5} \sim AOT_{anom} + smokeday + AOT_{anom} \times smokeday + AOT_{anom} \times smokeday \times pressure + stationFE + samplemonthFE\)

df <- dat_merged1
mdl0 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day | id_epa + month,
              data = df)
mdl1 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:height | id_epa + month,
              data = df)
mdl2 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:pressure | id_epa + month,
              data = df)

Within R2 is around 0.34 and is highest for model with height, though ultimately not differing much across models. Smoke day term and pressure interaction term are not significant. Interaction term with height is significant, and coefficient goes in expected direction but is small.

etable(mdl0, mdl1, mdl2, drop.section = "fixef")

Let’s get predicted PM2.5 and plot some comparisons.

df$pred_aotanom_smokeday <- predict(mdl0)
df$pred_aotanom_smokeday_height <- predict(mdl1)
df$pred_aotanom_smokeday_pressure <- predict(mdl2)

We filter to smoke days and compare the predictions of each model to observed PM2.5 values. They look very much the same. Seems there’s still quite some unexplained variation in PM2.5.

dfsd <- df %>% 
  filter(smoke_day == 1) %>% 
  pivot_longer(starts_with("pred_"), names_to = "mdl")
ggplot(data = dfsd, mapping = aes(x = value, y = pm25)) + 
  geom_point(aes(color = mdl), alpha = 0.2) + 
  facet_wrap(vars(mdl))

Here’s a correlation plot to confirm that the predictions are very close across the models.

dfsd <- df %>% 
  filter(smoke_day == 1) %>% 
  select(pm25, starts_with("pred_"))
corr_dfsd <- cor(dfsd, use = "complete.obs")
ggcorrplot(corr_dfsd, lab = TRUE)

Let’s try plotting time series where baseline model underestimates.

Looks like Mono county PM2.5 gets underestimated the most.

df$resid0 <- mdl0$residuals
df %>% slice_max(resid0, n = 50) %>% count(county) %>% arrange(desc(n))

The Slink Fire in Mono county was reported on August 29th and last updated October 22nd. So we plot the time series from mid-August through the end of October.

t1 <- as.Date("2020-08-15")
t2 <- as.Date("2020-10-31")
event <- df %>% 
  filter(t1 <= date, date <= t2, county == "Mono") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()

In addition to the baseline model underestimating PM2.5, our models with height or pressure don’t quite capture most spikes, nor do they differ much from the baseline model.

Let’s also take a look at our runner-up, Lane county.
The Holiday Farm Fire was reported on September 7th and contained on October 29th. So we plot the time series from late August to mid-November.

t1 <- as.Date("2020-08-22")
t2 <- as.Date("2020-11-15")
event <- df %>% 
  filter(t1 <= date, date <= t2, county == "Lane") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()

Our models underestimate PM2.5, but it’s not as bad as above. Also overestimates PM2.5 levels near zero. Model with height captures spike just before hitting the top of the big spike slightly better than the others.

Let’s take a look at a big fire and see how our models perform.
The August Complex started August 17th and was contained on November 15th. It burned in Glenn, Lake, Mendocino, Tehama, Trinity, and Shasta counties. So we plot the time series from early August to the end of November.

t1 <- as.Date("2020-08-01")
t2 <- as.Date("2020-11-30")
event <- df %>% 
  filter(t1 <= date, date <= t2, 
         county %in% c("Glenn", "Lake", "Mendocino", "Tehama", "Trinity", "Shasta")) %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()

Our models capture the big spike but noticeably overestimate the spike in late August. They also underestimate early August and overestimate a bit in early September. The model predictions overlap a lot, but at a few spikes the model with height seems to estimate a bit higher than the others.

Perhaps there are better ways of analyzing this, but based on R2 and plots, it seems to me that height isn’t making a big difference from our baseline model.

Update 2

We change to 30 km cutoff and fit our baseline model and the model with height using the same sample.

# Re-configure merged data to use 30 km cutoff
agg <- "idw"
cut <- 30
dat_merged1 <- dat_merged %>% filter(agg_method == agg, cutoff == cut)

# Drop obs where missing height
dat_merged1 <- dat_merged1 %>% drop_na(height)
df <- dat_merged1
mdl0 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day | id_epa + month,
              data = df)
mdl1 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:height | id_epa + month,
              data = df)
mdl2 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:pressure | id_epa + month,
              data = df)

Compared to using the 100 km cutoff, within R2 is a bit lower here. The model with height does best, though R2 differences are small. Coefficient on term with height seems similar to above, i.e. right direction but small. Coefficient on term with height is smaller in magnitude now. Also just noticed that coefficient on just smoke day term is negative, not sure if that makes sense.

etable(mdl0, mdl1, mdl2, drop.section = "fixef")

Let’s compare the models’ predicted PM2.5.

df$pred_aotanom_smokeday <- predict(mdl0)
df$pred_aotanom_smokeday_height <- predict(mdl1)
df$pred_aotanom_smokeday_pressure <- predict(mdl2)

We filter to smoke days and compare the predictions using their correlations.

dfsd <- df %>% 
  filter(smoke_day == 1) %>% 
  select(pm25, starts_with("pred_"))
corr_dfsd <- cor(dfsd, use = "complete.obs")
ggcorrplot(corr_dfsd, lab = TRUE)

Doesn’t seem like 30 km cutoff helped height have larger effect.

I wonder how much the choice of initial height parameters influences our results.

Adding here time series plot of August Complex with only observed PM2.5 and baseline model to help see lines more clearly.

t1 <- as.Date("2020-08-01")
t2 <- as.Date("2020-11-30")
event <- df %>% 
  select(-pred_aotanom_smokeday_height, -pred_aotanom_smokeday_pressure) %>% 
  filter(t1 <= date, date <= t2, 
         county %in% c("Glenn", "Lake", "Mendocino", "Tehama", "Trinity", "Shasta")) %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()

Update 3

Limit to California

We limit to EPA stations in California since trajectories were initiated in California and may not represent plumes in other states.

# Drop obs not in California
loc_epa <- dat_merged[c("lon_epa", "lat_epa")] %>% distinct()
us_states <- states() %>% as("Spatial")
us_states <- SpatialPolygonsDataFrame(SpatialPolygons(us_states@polygons), 
                                      us_states@data %>% select(NAME))
o <- over(SpatialPoints(loc_epa), us_states)
dat_merged0 <- dat_merged %>% 
  left_join(bind_cols(loc_epa, o)) %>% 
  rename(state = NAME) %>% 
  filter(state == "California")
agg <- "idw"
cut <- 30
dat_merged1 <- dat_merged0 %>% filter(agg_method == agg, cutoff == cut)
dat_merged1 <- dat_merged1 %>% drop_na(height)
df <- dat_merged1
mdl0 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day | id_epa + month,
              data = df)
mdl1 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:height | id_epa + month,
              data =df)
mdl2 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:pressure | id_epa + month,
              data = df)

Lost a little over half our obs to filtering for California. Smoke day alone coefficient is positive now. Compared to above, coefficient on anomalous AOT is bigger and coefficient on anomalous AOT and smoke day interaction term is smaller for baseline and height models and larger for pressure model. Interaction term with height is no longer statistically significant. Within R2 is lower now. Model with height still performs better than baseline model, but R2 differences are small, and now model with pressure performs best.

etable(mdl0, mdl1, mdl2, drop.section = "fixef")

We get the models’ predicted PM2.5. We stick to total PM2.5 for now. Plots of specifically smoke PM2.5 will come in next section.

df$pred_aotanom_smokeday <- predict(mdl0)
df$pred_aotanom_smokeday_height <- predict(mdl1)
df$pred_aotanom_smokeday_pressure <- predict(mdl2)

Predictions still very similar, though model with pressure more different than model with height when compared against baseline.

dfsd <- df %>% 
  filter(smoke_day == 1) %>% 
  select(pm25, starts_with("pred_"))
corr_dfsd <- cor(dfsd, use = "complete.obs")
ggcorrplot(corr_dfsd, lab = TRUE)

Plotting time series for Bay Area anew.

event <- df %>% 
  filter(t1 <= date, date <= t2, cbsa_name == "San Francisco-Oakland-Hayward, CA") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()

Models underestimate the big peak, but model with height and baseline get a bit more of the top spike than does model with pressure. Medium peaks somewhat captured. Lower levels of PM2.5 get overestimated a bit.

Mono county gets underestimated the most again.

df$resid0 <- mdl0$residuals
df %>% slice_max(resid0, n = 50) %>% count(county) %>% arrange(desc(n))

Slink Fire again.

t1 <- as.Date("2020-08-15")
t2 <- as.Date("2020-10-31")
event <- df %>% 
  filter(t1 <= date, date <= t2, county == "Mono") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()

Model with pressure captures peaks a bit better. All models overestimate low levels of PM2.5.

August Complex again.

t1 <- as.Date("2020-08-01")
t2 <- as.Date("2020-11-30")
event <- df %>% 
  filter(t1 <= date, date <= t2, 
         county %in% c("Glenn", "Lake", "Mendocino", "Tehama", "Trinity", "Shasta")) %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()

Model with pressure captures big peak best. All models overestimate late August peak and underestimate late September peaks.

IDW average within-trajectory minimum height

Now we use IDW average of daily within-trajectory minimum heights within buffer zone. In other words, for each EPA station-day-PM2.5 combination, first found trajectory points on the same day within a set radius, next grouped by trajectory ID and found trajectory point with minimum height (maximum pressure) for each trajectory, and then IDW average the height (pressure) values across trajectories.

# New option "extr" for using within-trajectory extrema
agg <- "extr"
cut <- 30
dat_merged1 <- dat_merged0 %>% filter(agg_method == agg, cutoff == cut)
dat_merged1 <- dat_merged1 %>% drop_na(height)
df <- dat_merged1
mdl0 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day | id_epa + month,
              data = df)
mdl1 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:height | id_epa + month,
              data = df)
mdl2 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:pressure | id_epa + month,
              data  = df)

Results are pretty similar to above. So the change in matching method doesn’t seem to make a big difference.

etable(mdl0, mdl1, mdl2, drop.section = "fixef")

Now we predict and plot smoke PM2.5. Not sure if FEs are also supposed to go in here so just not including for now.

smoke_day <- df$smoke_day
aot_anom <- df$aot_anom
height <- df$height
pressure <- df$pressure

coefs <- coefficients(mdl0)
df$pred_aotanom_smokeday <- coefs["smoke_day"]*smoke_day + 
  coefs["aot_anom:smoke_day"]*aot_anom*smoke_day
coefs <- coefficients(mdl1)
df$pred_aotanom_smokeday_height <- coefs["smoke_day"]*smoke_day + 
  coefs["aot_anom:smoke_day"]*aot_anom*smoke_day + 
  coefs["aot_anom:smoke_day:height"]*aot_anom*smoke_day*height
coefs <- coefficients(mdl2)
df$pred_aotanom_smokeday_pressure <- coefs["smoke_day"]*smoke_day + 
  coefs["aot_anom:smoke_day"]*aot_anom*smoke_day + 
  coefs["aot_anom:smoke_day:pressure"]*aot_anom*smoke_day*pressure

We filter to smoke days and compare the predictions of each model to total PM2.5 values. Pressure model seems to do slightly better, but predictions don’t get past 150 micrograms, so seems we probably won’t be able to capture big spikes.

dfsd <- df %>% 
  filter(smoke_day == 1) %>% 
  pivot_longer(starts_with("pred_"), names_to = "mdl")
ggplot(data = dfsd, mapping = aes(x = value, y = pm25)) + 
  geom_point(aes(color = mdl), alpha = 0.2) + 
  facet_wrap(vars(mdl))

Bay Area again. Since we’re predicting smoke PM2.5 and not total PM2.5, predictions are lower now. Pressure model does worse than baseline and height models in late September/early October, but otherwise the models are pretty similar.

event <- df %>% 
  filter(t1 <= date, date <= t2, cbsa_name == "San Francisco-Oakland-Hayward, CA") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()

Pressure model predictions differ from baseline model predictions by at most about 61 micrograms, though this is uncommon.

diff_height <- df$pred_aotanom_smokeday_height - df$pred_aotanom_smokeday
summary(diff_height)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-16.45710   0.00000   0.00000  -0.01706   0.00000  10.47510 
quantile(diff_height, c(0.01, 0.05, 0.95, 0.99))
        1%         5%        95%        99% 
-2.9539452 -0.7675867  0.6609877  2.7859131 
diff_pressure <- df$pred_aotanom_smokeday_pressure - df$pred_aotanom_smokeday
summary(diff_pressure)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-52.71067   0.00000   0.00000   0.05918   0.00000  61.10585 
quantile(diff_pressure, c(0.01, 0.05, 0.95, 0.99))
       1%        5%       95%       99% 
-9.887987 -2.609854  2.780110 11.325619 

From the above, height doesn’t generally seem to help much.

Also took a look at Brey et al. again and a few questions popped up since we follow them up to initializing trajectories but not after that:
- Should we still try to match trajectory points to plumes? Brey et al. throws away trajectories that don’t overlap a plume in the first 48 hours. Currently, we don’t do any overlapping with plume, but if there’s a lot of false positives, it could help to overlap trajectories with plumes first.
- Brey et al. throws away trajectory points once they reach 0 meters in height. I’m not sure why they do this, but is this something we should consider doing? It is possible for particles to sit on the ground for a while and then pick back up into the atmosphere (very briefly looked into this in the data before out of curiosity).

Update 4

EPA sometimes records multiple PM2.5 values for the same station-day, so let’s try averaging each station-day’s PM2.5 values and see if that changes anything.

dat_merged1 <- dat_merged0 %>% 
  filter(agg_method == agg, cutoff == cut) %>% 
  drop_na(height)
dat_merged1 <- dat_merged0 %>% 
  filter(agg_method == agg, cutoff == cut) %>% 
  drop_na(height)
dfpm <- dat_merged1 %>% 
  group_by(date, id_epa) %>% 
  summarize(pm25 = mean(pm25)) %>% 
  ungroup()
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
dat_merged1 <- dat_merged0 %>% 
  filter(agg_method == agg, cutoff == cut) %>% 
  drop_na(height)
dfpm <- dat_merged1 %>% 
  group_by(date, id_epa) %>% 
  summarize(pm25 = mean(pm25)) %>% 
  ungroup()
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
dat_merged1 <- dat_merged1 %>% 
  select(-pm25) %>% 
  distinct() %>% 
  left_join(dfpm)
Joining, by = c("date", "id_epa")
df <- dat_merged1
mdl0 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day | id_epa + month,
              data = df)
mdl1 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:height | id_epa + month,
              data = df)
mdl2 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:pressure | id_epa + month,
              data = df)
etable(mdl0, mdl1, mdl2, drop.section = "fixef")

Doesn’t look much different from above.

---
title: "Does Plume Height from HYSPLIT Improve Daily-Level PM2.5 Prediction?"
output: html_notebook
---

```{r include = FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
library(dplyr)
library(tidyr)
library(lubridate)
library(fixest)
library(stargazer)
library(ggplot2)
library(ggcorrplot)
library(sp)
library(tigris)

# Read in merged data
path_dropbox <- "/Users/jessssli/BurkeLab Dropbox/Data/"
dat_merged <- readRDS(paste0(path_dropbox, "hms_hysplit/trajectories/trajectories_EPA_smoke_AOT_CA_2020.rds")) %>% 
  mutate(date = as.Date(date), 
         year = year(date),
         month = month(date),
         day = day(date))
```

Start with data containing PM2.5, anomalous AOT, smoke day, height in meters 
above ground level, and pressure in hPa. Data is the result of matching 72-hour 
trajectories initialized from HYSPLIT points in California 2020 following
Brey et al. with EPA station-day-PM2.5 data and then merging with 10 km grid of
AOT, mean AOT, and smoke day. Anomalous AOT is AOT - mean AOT.

Aggregation method and cutoff determine how trajectory points were matched to
EPA stations.

```{r}
# Choose aggregation method: nn, idw, or k (idw of knn)
agg <- "idw"

# Choose cutoff: 10, 30, 50, 80, or 100 km
cut <- 30

# Configure data frame based on the above
dat_merged1 <- dat_merged %>% filter(agg_method == agg, cutoff == cut)
```

## Let's start modeling

We model PM2.5  
  * $PM_{2.5} \sim AOT_{anom} + AOT_{anom} \times smokeday$  
  * $PM_{2.5} \sim AOT_{anom} + AOT_{anom} \times smokeday + AOT_{anom} \times smokeday \times height$  
  * $PM_{2.5} \sim AOT_{anom} + AOT_{anom} \times smokeday + AOT_{anom} \times smokeday \times pressure$  

```{r}
df <- dat_merged1
mdl0 <- lm(pm25 ~ aot_anom + aot_anom:smoke_day,
           data = df)
mdl1 <- lm(pm25 ~ aot_anom + aot_anom:smoke_day + aot_anom:smoke_day:height,
           data = df)
mdl2 <- lm(pm25 ~ aot_anom + aot_anom:smoke_day + aot_anom:smoke_day:pressure,
           data = df)
```

Model with just anomalous AOT and smoke day does better than models with height
or pressure. Based on the last PM2.5 modeling notebook I saw, I think these
R2 values are higher? All terms are significant, but effect sizes are small for
terms with height or pressure (or maybe they're not small and it's just an 
artifact of units measuring height or pressure). Effect direction for height 
follows what I would expect, but not pressure (hope I'm recalling science 
correctly here).

```{r echo = FALSE, results = 'asis'}
suppressWarnings(stargazer(mdl0, mdl1, mdl2, type = "html"))
```


However, I am not sure if this is the right regression or if we
should still include fixed effects of some sort. If I've understood correctly,
anomalizing AOT accomplishes what FEs had originally been intended to do, which
was capture background AOT. But I guess we may also want to capture background 
PM2.5? So we'll try that later in the document.

Before that, let's get predicted PM2.5 and compare the time series from August 
to Mid-October in the Bay Area.

```{r}
df$pred_aotanom_smokeday <- predict(mdl0)
df[!is.na(df$height), "pred_aotanom_smokeday_height"] <- predict(mdl1)
df[!is.na(df$pressure), "pred_aotanom_smokeday_pressure"] <- predict(mdl2)
```

```{r}
t1 <- as.Date("2020-08-01")
t2 <- as.Date("2020-10-15")
event <- df %>% 
  filter(t1 <= date, date <= t2, cbsa_name == "San Francisco-Oakland-Hayward, CA") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()
```

It look like all the models are capturing the big spike and then overestimating 
the medium spikes sometimes. Not much difference between the model predictions, 
except we're missing predictions where height and pressure are missing due to no 
match between EPA station and trajectory points.

## Now we use demeaned PM2.5

We regress PM2.5 on fixed effects for EPA station ID and month of the year (we 
only have 2020 here) to get demeaned PM2.5 values.

```{r}
df <- dat_merged1

# Get residuals and fitted values
mdlfe <- feols(pm25 ~ 1 | id_epa + month, data = df)
df$pm25_dmn <- mdlfe$resid
df$fv <- mdlfe$fitted.values
```

Same models as before but now on demeaned PM2.5.

```{r}
mdl0 <- lm(pm25_dmn ~ aot_anom + aot_anom:smoke_day,
           data = df)
mdl1 <- lm(pm25_dmn ~ aot_anom + aot_anom:smoke_day + aot_anom:smoke_day:height,
           data = df)
mdl2 <- lm(pm25_dmn ~ aot_anom + aot_anom:smoke_day + aot_anom:smoke_day:pressure,
           data = df)
```

These R2 values now look more like the ones from the last notebook I saw. R2 is 
slightly larger for models with height or pressure, and model with height does 
best. All terms are significant.

```{r echo = FALSE, results = 'asis'}
suppressWarnings(stargazer(mdl0, mdl1, mdl2, type = "html"))
```

We add back in the fitted values from the FE regression, get predicted PM2.5 
(fitted value from FE reg + residual predicted by our models), and compare the 
time series from August to Mid-October in the Bay Area.

```{r}
df$pred_dmn_aotanom_smokeday <- predict(mdl0)
df[!is.na(df$height), "pred_dmn_aotanom_smokeday_height"] <- predict(mdl1)
df[!is.na(df$pressure), "pred_dmn_aotanom_smokeday_pressure"] <- predict(mdl2)

df <- df %>% 
  mutate(across(starts_with("pred_"), as.numeric),
         pred_dmn_aotanom_smokeday = pred_dmn_aotanom_smokeday + fv,
         pred_dmn_aotanom_smokeday_height = pred_dmn_aotanom_smokeday_height + fv,
         pred_dmn_aotanom_smokeday_pressure = pred_dmn_aotanom_smokeday_pressure + fv)
```

```{r}
event <- df %>% 
  filter(t1 <= date, date <= t2, cbsa_name == "San Francisco-Oakland-Hayward, CA") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()
```

Models capture the big spike, though now slightly underestimating. But now 
greatly overestimating the big dip in later September. Overestimating the rise 
in early September. Overestimating the lower PM2.5 values generally. Still have 
weird inversions in later August, though not as severe as in the previous graph.

It looks like models with just anomalous AOT and smoke day are able to
capture the big spike, so I wonder if height is really needed, especially 
considering the missingness due to matching, or if I've done something wrong 
somewhere...

## Update 1

We change to using the 100 km cutoff so that we don't miss observations 
across models.

```{r}
# Re-configure merged data to use max cutoff (100 km)
agg <- "idw"
cut <- 100
dat_merged1 <- dat_merged %>% filter(agg_method == agg, cutoff == cut)
```

We change our baseline model to  
* $PM_{2.5} \sim AOT_{anom} + smokeday + AOT_{anom} \times smokeday + stationFE + samplemonthFE$  
and compare with  
* $PM_{2.5} \sim AOT_{anom} + smokeday + AOT_{anom} \times smokeday + AOT_{anom} \times smokeday \times height + stationFE + samplemonthFE$  
* $PM_{2.5} \sim AOT_{anom} + smokeday + AOT_{anom} \times smokeday + AOT_{anom} \times smokeday \times pressure + stationFE + samplemonthFE$  

```{r}
df <- dat_merged1
mdl0 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day | id_epa + month,
              data = df)
mdl1 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:height | id_epa + month,
              data = df)
mdl2 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:pressure | id_epa + month,
              data = df)
```

Within R2 is around 0.34 and is highest for model with height, though ultimately
not differing much across models. Smoke day term and pressure interaction term 
are not significant. Interaction term with height is significant, and coefficient 
goes in expected direction but is small.

```{r}
etable(mdl0, mdl1, mdl2, drop.section = "fixef")
```

Let's get predicted PM2.5 and plot some comparisons.

```{r}
df$pred_aotanom_smokeday <- predict(mdl0)
df$pred_aotanom_smokeday_height <- predict(mdl1)
df$pred_aotanom_smokeday_pressure <- predict(mdl2)
```

We filter to smoke days and compare the predictions of each model to observed
PM2.5 values. They look very much the same. Seems there's still quite some
unexplained variation in PM2.5.

```{r}
dfsd <- df %>% 
  filter(smoke_day == 1) %>% 
  pivot_longer(starts_with("pred_"), names_to = "mdl")
ggplot(data = dfsd, mapping = aes(x = value, y = pm25)) + 
  geom_point(aes(color = mdl), alpha = 0.2) + 
  facet_wrap(vars(mdl))
```

Here's a correlation plot to confirm that the predictions are very close across
the models.

```{r}
dfsd <- df %>% 
  filter(smoke_day == 1) %>% 
  select(pm25, starts_with("pred_"))
corr_dfsd <- cor(dfsd, use = "complete.obs")
ggcorrplot(corr_dfsd, lab = TRUE)
```

Let's try plotting time series where baseline model underestimates.

Looks like Mono county PM2.5 gets underestimated the most.

```{r}
df$resid0 <- mdl0$residuals
df %>% slice_max(resid0, n = 50) %>% count(county) %>% arrange(desc(n))
```

The [Slink Fire](https://inciweb.nwcg.gov/incident/7105/) in Mono county was 
reported on August 29th and last updated October 22nd. So we plot the time series 
from mid-August through the end of October.

```{r}
t1 <- as.Date("2020-08-15")
t2 <- as.Date("2020-10-31")
event <- df %>% 
  filter(t1 <= date, date <= t2, county == "Mono") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()
```

In addition to the baseline model underestimating PM2.5, our models with height 
or pressure don't quite capture most spikes, nor do they differ much from the 
baseline model.

Let's also take a look at our runner-up, Lane county.  
The [Holiday Farm Fire](https://inciweb.nwcg.gov/incident/7170/) was reported on 
September 7th and contained on October 29th. So we plot the time series from late 
August to mid-November.

```{r}
t1 <- as.Date("2020-08-22")
t2 <- as.Date("2020-11-15")
event <- df %>% 
  filter(t1 <= date, date <= t2, county == "Lane") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()
```

Our models underestimate PM2.5, but it's not as bad as above. Also overestimates
PM2.5 levels near zero. Model with height captures spike just before hitting
the top of the big spike slightly better than the others.

Let's take a look at a big fire and see how our models perform.  
The [August Complex](https://inciweb.nwcg.gov/incident/6983/) started August 17th
and was contained on November 15th. It [burned](https://en.wikipedia.org/wiki/August_Complex_fire) 
in Glenn, Lake, Mendocino, Tehama, Trinity, and Shasta counties. So we plot the 
time series from early August to the end of November.

```{r}
t1 <- as.Date("2020-08-01")
t2 <- as.Date("2020-11-30")
event <- df %>% 
  filter(t1 <= date, date <= t2, 
         county %in% c("Glenn", "Lake", "Mendocino", "Tehama", "Trinity", "Shasta")) %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()
```

Our models capture the big spike but noticeably overestimate the spike in late 
August. They also underestimate early August and overestimate a bit in early 
September. The model predictions overlap a lot, but at a few spikes the model
with height seems to estimate a bit higher than the others.

Perhaps there are better ways of analyzing this, but based on R2 and plots, it
seems to me that height isn't making a big difference from our baseline model.

## Update 2

We change to 30 km cutoff and fit our baseline model and the model with height
using the same sample.

```{r}
# Re-configure merged data to use 30 km cutoff
agg <- "idw"
cut <- 30
dat_merged1 <- dat_merged %>% filter(agg_method == agg, cutoff == cut)

# Drop obs where missing height
dat_merged1 <- dat_merged1 %>% drop_na(height)
```

```{r}
df <- dat_merged1
mdl0 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day | id_epa + month,
              data = df)
mdl1 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:height | id_epa + month,
              data = df)
mdl2 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:pressure | id_epa + month,
              data = df)
```

Compared to using the 100 km cutoff, within R2 is a bit lower here. The model 
with height does best, though R2 differences are small. Coefficient on term with 
height seems similar to above, i.e. right direction but small. Coefficient on 
term with height is smaller in magnitude now. Also just noticed that coefficient 
on just smoke day term is negative, not sure if that makes sense.

```{r}
etable(mdl0, mdl1, mdl2, drop.section = "fixef")
```

Let's compare the models' predicted PM2.5.

```{r}
df$pred_aotanom_smokeday <- predict(mdl0)
df$pred_aotanom_smokeday_height <- predict(mdl1)
df$pred_aotanom_smokeday_pressure <- predict(mdl2)
```

We filter to smoke days and compare the predictions using their correlations.

```{r}
dfsd <- df %>% 
  filter(smoke_day == 1) %>% 
  select(pm25, starts_with("pred_"))
corr_dfsd <- cor(dfsd, use = "complete.obs")
ggcorrplot(corr_dfsd, lab = TRUE)
```

Doesn't seem like 30 km cutoff helped height have larger effect.

I wonder how much the choice of initial height parameters influences our results.

Adding here time series plot of August Complex with only observed PM2.5 and 
baseline model to help see lines more clearly.

```{r}
t1 <- as.Date("2020-08-01")
t2 <- as.Date("2020-11-30")
event <- df %>% 
  select(-pred_aotanom_smokeday_height, -pred_aotanom_smokeday_pressure) %>% 
  filter(t1 <= date, date <= t2, 
         county %in% c("Glenn", "Lake", "Mendocino", "Tehama", "Trinity", "Shasta")) %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()
```

## Update 3

### Limit to California

We limit to EPA stations in California since trajectories were initiated in
California and may not represent plumes in other states.

```{r message = FALSE, results = "hide"}
# Drop obs not in California
loc_epa <- dat_merged[c("lon_epa", "lat_epa")] %>% distinct()
us_states <- states() %>% as("Spatial")
us_states <- SpatialPolygonsDataFrame(SpatialPolygons(us_states@polygons), 
                                      us_states@data %>% select(NAME))
o <- over(SpatialPoints(loc_epa), us_states)
dat_merged0 <- dat_merged %>% 
  left_join(bind_cols(loc_epa, o)) %>% 
  rename(state = NAME) %>% 
  filter(state == "California")
```

```{r}
agg <- "idw"
cut <- 30
dat_merged1 <- dat_merged0 %>% filter(agg_method == agg, cutoff == cut)
dat_merged1 <- dat_merged1 %>% drop_na(height)
```

```{r}
df <- dat_merged1
mdl0 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day | id_epa + month,
              data = df)
mdl1 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:height | id_epa + month,
              data =df)
mdl2 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:pressure | id_epa + month,
              data = df)
```

Lost a little over half our obs to filtering for California. Smoke day alone 
coefficient is positive now. Compared to above, coefficient on anomalous AOT is 
bigger and coefficient on anomalous AOT and smoke day interaction term is 
smaller for baseline and height models and larger for pressure model. 
Interaction term with height is no longer statistically significant. Within R2 
is lower now. Model with height still performs better than baseline model, but 
R2 differences are small, and now model with pressure performs best.

```{r}
etable(mdl0, mdl1, mdl2, drop.section = "fixef")
```

We get the models' predicted PM2.5. We stick to total PM2.5 for now. Plots of
specifically smoke PM2.5 will come in next section.

```{r}
df$pred_aotanom_smokeday <- predict(mdl0)
df$pred_aotanom_smokeday_height <- predict(mdl1)
df$pred_aotanom_smokeday_pressure <- predict(mdl2)
```

Predictions still very similar, though model with pressure more different than
model with height when compared against baseline.

```{r}
dfsd <- df %>% 
  filter(smoke_day == 1) %>% 
  select(pm25, starts_with("pred_"))
corr_dfsd <- cor(dfsd, use = "complete.obs")
ggcorrplot(corr_dfsd, lab = TRUE)
```

Plotting time series for Bay Area anew.

```{r}
event <- df %>% 
  filter(t1 <= date, date <= t2, cbsa_name == "San Francisco-Oakland-Hayward, CA") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()
```

Models underestimate the big peak, but model with height and baseline get a bit 
more of the top spike than does model with pressure. Medium peaks somewhat 
captured. Lower levels of PM2.5 get overestimated a bit.

Mono county gets underestimated the most again.

```{r}
df$resid0 <- mdl0$residuals
df %>% slice_max(resid0, n = 50) %>% count(county) %>% arrange(desc(n))
```

Slink Fire again.

```{r}
t1 <- as.Date("2020-08-15")
t2 <- as.Date("2020-10-31")
event <- df %>% 
  filter(t1 <= date, date <= t2, county == "Mono") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()
```

Model with pressure captures peaks a bit better. All models overestimate low
levels of PM2.5.

August Complex again.

```{r}
t1 <- as.Date("2020-08-01")
t2 <- as.Date("2020-11-30")
event <- df %>% 
  filter(t1 <= date, date <= t2, 
         county %in% c("Glenn", "Lake", "Mendocino", "Tehama", "Trinity", "Shasta")) %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()
```

Model with pressure captures big peak best. All models overestimate late August
peak and underestimate late September peaks.

### IDW average within-trajectory minimum height

Now we use IDW average of daily within-trajectory minimum heights within buffer 
zone. In other words, for each EPA station-day-PM2.5 combination, first found 
trajectory points on the same day within a set radius, next grouped by 
trajectory ID and found trajectory point with minimum height (maximum pressure) 
for each trajectory, and then IDW average the height (pressure) values across 
trajectories.

```{r}
# New option "extr" for using within-trajectory extrema
agg <- "extr"
cut <- 30
dat_merged1 <- dat_merged0 %>% filter(agg_method == agg, cutoff == cut)
dat_merged1 <- dat_merged1 %>% drop_na(height)
```

```{r}
df <- dat_merged1
mdl0 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day | id_epa + month,
              data = df)
mdl1 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:height | id_epa + month,
              data = df)
mdl2 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:pressure | id_epa + month,
              data  = df)
```

Results are pretty similar to above. So the change in matching method doesn't 
seem to make a big difference.

```{r}
etable(mdl0, mdl1, mdl2, drop.section = "fixef")
```

Now we predict and plot smoke PM2.5. Not sure if FEs are also supposed to go in 
here so just not including for now.

```{r}
smoke_day <- df$smoke_day
aot_anom <- df$aot_anom
height <- df$height
pressure <- df$pressure

coefs <- coefficients(mdl0)
df$pred_aotanom_smokeday <- coefs["smoke_day"]*smoke_day + 
  coefs["aot_anom:smoke_day"]*aot_anom*smoke_day
coefs <- coefficients(mdl1)
df$pred_aotanom_smokeday_height <- coefs["smoke_day"]*smoke_day + 
  coefs["aot_anom:smoke_day"]*aot_anom*smoke_day + 
  coefs["aot_anom:smoke_day:height"]*aot_anom*smoke_day*height
coefs <- coefficients(mdl2)
df$pred_aotanom_smokeday_pressure <- coefs["smoke_day"]*smoke_day + 
  coefs["aot_anom:smoke_day"]*aot_anom*smoke_day + 
  coefs["aot_anom:smoke_day:pressure"]*aot_anom*smoke_day*pressure
```

We filter to smoke days and compare the predictions of each model to total 
PM2.5 values. Pressure model seems to do slightly better, but predictions don't 
get past 150 micrograms, so seems we probably won't be able to capture big 
spikes.

```{r}
dfsd <- df %>% 
  filter(smoke_day == 1) %>% 
  pivot_longer(starts_with("pred_"), names_to = "mdl")
ggplot(data = dfsd, mapping = aes(x = value, y = pm25)) + 
  geom_point(aes(color = mdl), alpha = 0.2) + 
  facet_wrap(vars(mdl))
```

Bay Area again. Since we're predicting smoke PM2.5 and not total PM2.5, 
predictions are lower now. Pressure model does worse than baseline and height 
models in late September/early October, but otherwise the models are pretty 
similar.

```{r}
event <- df %>% 
  filter(t1 <= date, date <= t2, cbsa_name == "San Francisco-Oakland-Hayward, CA") %>% 
  group_by(date) %>% 
  summarize(across(c(pm25, starts_with("pred")), mean, na.rm = TRUE)) %>% 
  pivot_longer(c(pm25, starts_with("pred_")), names_to = "mdl")

ggplot(data = event, mapping = aes(x = date, y = value)) +
  geom_line(mapping = aes(color = mdl)) +
  theme_classic()
```

Pressure model predictions differ from baseline model predictions by at most 
about 61 micrograms, though this is uncommon.

```{r}
diff_height <- df$pred_aotanom_smokeday_height - df$pred_aotanom_smokeday
summary(diff_height)
quantile(diff_height, c(0.01, 0.05, 0.95, 0.99))

diff_pressure <- df$pred_aotanom_smokeday_pressure - df$pred_aotanom_smokeday
summary(diff_pressure)
quantile(diff_pressure, c(0.01, 0.05, 0.95, 0.99))
```

From the above, height doesn't generally seem to help much.

Also took a look at Brey et al. again and a few questions popped up since we 
follow them up to initializing trajectories but not after that:  
- Should we still try to match trajectory points to plumes? Brey et al. throws 
away trajectories that don't overlap a plume in the first 48 hours. Currently, 
we don't do any overlapping with plume, but if there's a lot of false positives, 
it could help to overlap trajectories with plumes first.  
- Brey et al. throws away trajectory points once they reach 0 meters in height. 
I'm not sure why they do this, but is this something we should consider doing? 
It is possible for particles to sit on the ground for a while and then pick back
up into the atmosphere (very briefly looked into this in the data before out of 
curiosity).

## Update 4

EPA sometimes records multiple PM2.5 values for the same station-day, so let's 
try averaging each station-day's PM2.5 values and see if that changes anything.

```{r}
agg <- "idw"
cut <- 30
dat_merged1 <- dat_merged0 %>% 
  filter(agg_method == agg, cutoff == cut) %>% 
  drop_na(height)
dfpm <- dat_merged1 %>% 
  group_by(date, id_epa) %>% 
  summarize(pm25 = mean(pm25)) %>% 
  ungroup()
dat_merged1 <- dat_merged1 %>% 
  select(-pm25) %>% 
  distinct() %>% 
  left_join(dfpm)

```

```{r}
df <- dat_merged1
mdl0 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day | id_epa + month,
              data = df)
mdl1 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:height | id_epa + month,
              data = df)
mdl2 <- feols(pm25 ~ aot_anom + smoke_day + aot_anom:smoke_day + aot_anom:smoke_day:pressure | id_epa + month,
              data = df)
```

```{r}
etable(mdl0, mdl1, mdl2, drop.section = "fixef")
```

Doesn't look much different from above.


